{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Housing Prices (1).ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "display_name": "Python 2",
      "language": "python",
      "name": "python2"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 2
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython2",
      "version": "2.7.10"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tQgn5l2TY-r4",
        "colab_type": "text"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_4XqvudaY969",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MeBMw6qzY5Gv",
        "colab_type": "text"
      },
      "source": [
        "# Feature columns visualization\n",
        "\n",
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github.com/tensorflow/examples/blob/master/community/en/hashing_trick.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />\n",
        "    Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/tree/master/community/en/hashing_trick.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />\n",
        "    View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8lRXsjKCY5Gw",
        "colab_type": "text"
      },
      "source": [
        "This example demonstrates the use `tf.feature_column.crossed_column` on some simulated Atlanta housing price data. \n",
        "This spatial data is used primarily so the results can be easily visualized. \n",
        "\n",
        "These functions are designed primarily for categorical data, not to build interpolation tables. \n",
        "\n",
        "If you actually want to build smart interpolation tables in TensorFlow you may want to consider [TensorFlow Lattice](https://research.googleblog.com/2017/10/tensorflow-lattice-flexibility.html)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "yEHBFimYk-Mu"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "DiAklWTFk-My",
        "colab": {}
      },
      "source": [
        "import os\n",
        "import subprocess\n",
        "import tempfile\n",
        "\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "import matplotlib as mpl\n",
        "import matplotlib.pyplot as plt"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OvXAbgwXY5G4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "mpl.rcParams['figure.figsize'] = 12, 6\n",
        "mpl.rcParams['image.cmap'] = 'viridis'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WFYz5eg1k-M7"
      },
      "source": [
        "## Build Synthetic Data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "F3ouYc9N9zW3",
        "colab": {}
      },
      "source": [
        "# Define the grid\n",
        "min_latitude = 33.641336\n",
        "max_latitude = 33.887157\n",
        "delta_latitude = max_latitude-min_latitude\n",
        "\n",
        "min_longitude = -84.558798\n",
        "max_longitude = -84.287259\n",
        "delta_longitude = max_longitude-min_longitude\n",
        "\n",
        "resolution = 100"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rGJxtBziY5HI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Use RandomState so the behavior is repeatable. \n",
        "R = np.random.RandomState(1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "JQ7wXpURk-M8",
        "colab": {}
      },
      "source": [
        "# The price data will be a sum of Gaussians, at random locations.\n",
        "n_centers = 20\n",
        "centers = R.rand(n_centers, 2)  # shape: (centers, dimensions)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "nR1wQiqSk-NA",
        "colab": {}
      },
      "source": [
        "# Each Gaussian has a maximum price contribution, at the center.\n",
        "# Price_\n",
        "price_delta = 0.5+2*R.rand(n_centers)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "rGX3j-DOk-NC",
        "colab": {}
      },
      "source": [
        "# Each Gaussian also has a standard-deviation and variance.\n",
        "std = 0.2*R.rand(n_centers)  # shape: (centers)\n",
        "var = std**2"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "kWu2ba4Tk-NE",
        "colab": {}
      },
      "source": [
        "def price(latitude, longitude):\n",
        "    # Convert latitude, longitude to x,y in [0,1]\n",
        "    x = (longitude - min_longitude)/delta_longitude\n",
        "    y = (latitude - min_latitude)/delta_latitude\n",
        "    \n",
        "    # Cache the shape, and flatten the inputs.\n",
        "    shape = x.shape\n",
        "    assert y.shape == x.shape\n",
        "    x = x.flatten()\n",
        "    y = y.flatten()\n",
        "    \n",
        "    # Convert x, y examples into an array with shape (examples, dimensions)\n",
        "    xy = np.array([x,y]).T\n",
        "\n",
        "    # Calculate the square distance from each example to each center.  \n",
        "    components2 = (xy[:,None,:] - centers[None,:,:])**2  # shape: (examples, centers, dimensions)\n",
        "    r2 = components2.sum(axis=2)  # shape: (examples, centers)\n",
        "    \n",
        "    # Calculate the z**2 for each example from each center.\n",
        "    z2 = r2/var[None,:]\n",
        "    price = (np.exp(-z2)*price_delta).sum(1)  # shape: (examples,)\n",
        "    \n",
        "    # Restore the original shape.\n",
        "    return price.reshape(shape)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "BPoSndW8k-NG",
        "colab": {}
      },
      "source": [
        "# Build the grid. We want `resolution` cells between `min` and `max` on each dimension\n",
        "# so we need `resolution+1` evenly spaced edges. The centers are at the average of the\n",
        "# upper and lower edge. \n",
        "\n",
        "latitude_edges = np.linspace(min_latitude, max_latitude, resolution+1)\n",
        "latitude_centers = (latitude_edges[:-1] + latitude_edges[1:])/2\n",
        "\n",
        "longitude_edges = np.linspace(min_longitude, max_longitude, resolution+1)\n",
        "longitude_centers = (longitude_edges[:-1] + longitude_edges[1:])/2\n",
        "\n",
        "latitude_grid, longitude_grid = np.meshgrid(\n",
        "    latitude_centers,\n",
        "    longitude_centers)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "0Y5fSCpWk-NI",
        "colab": {}
      },
      "source": [
        "# Evaluate the price at each center-point\n",
        "actual_price_grid = price(latitude_grid, longitude_grid)\n",
        "\n",
        "price_min = actual_price_grid.min()\n",
        "price_max = actual_price_grid.max()\n",
        "price_mean = actual_price_grid.mean()\n",
        "price_mean"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "wN2wMUOck-NK",
        "colab": {}
      },
      "source": [
        "def show_price(price):\n",
        "    plt.imshow(\n",
        "        price, \n",
        "        # The color axis goes from `price_min` to `price_max`.\n",
        "        vmin=price_min, vmax=price_max,\n",
        "        # Put the image at the correct latitude and longitude.\n",
        "        extent=(min_longitude, max_longitude, min_latitude, max_latitude), \n",
        "        # Make the image square.\n",
        "        aspect = 1.0*delta_longitude/delta_latitude)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "FkHXxJGuAsC1",
        "colab": {}
      },
      "source": [
        "show_price(actual_price_grid)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "miDtqLRek-NM"
      },
      "source": [
        "## Build Datasets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "zqBPKjBvk-NM",
        "colab": {}
      },
      "source": [
        "# For test data we will use the grid centers.\n",
        "test_features = {'latitude':latitude_grid.flatten(), 'longitude':longitude_grid.flatten()}\n",
        "test_ds = tf.data.Dataset.from_tensor_slices((test_features, \n",
        "                                           actual_price_grid.flatten()))\n",
        "test_ds = test_ds.cache().batch(512).prefetch(1)\n",
        "\n",
        "# For training data we will use a set of random points.\n",
        "train_latitude = min_latitude + np.random.rand(50000)*delta_latitude\n",
        "train_longitude = min_longitude + np.random.rand(50000)*delta_longitude\n",
        "train_price = price(train_latitude, train_longitude)\n",
        "\n",
        "train_features = {'latitude':train_latitude, 'longitude':train_longitude}\n",
        "train_ds = tf.data.Dataset.from_tensor_slices((train_features, train_price))\n",
        "train_ds = train_ds.cache().shuffle(100000).batch(512).prefetch(1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "i-ToQzSqk-NO"
      },
      "source": [
        "## Generate a plot from an Estimator"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tA4-qkPwcpzN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "ag = actual_price_grid.reshape(resolution, resolution)\n",
        "ag.shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "N9NmQLUOk-NP",
        "colab": {}
      },
      "source": [
        "def plot_model(model, ds = test_ds):\n",
        "    # Create two plot axes\n",
        "    actual, predicted = plt.subplot(1,2,1), plt.subplot(1,2,2)\n",
        "\n",
        "    # Plot the actual price.\n",
        "    plt.sca(actual)\n",
        "    plt.pcolor(actual_price_grid.reshape(resolution, resolution))\n",
        "    \n",
        "    # Generate predictions over the grid from the estimator.\n",
        "    pred =  model.predict(ds)\n",
        "    # Convert them to a numpy array.\n",
        "    pred = np.fromiter((item for item in pred), np.float32)\n",
        "    # Plot the predictions on the secodn axis.\n",
        "    plt.sca(predicted)\n",
        "    plt.pcolor(pred.reshape(resolution, resolution))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d3e8Fhrugua3",
        "colab_type": "text"
      },
      "source": [
        "## A linear regressor"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "y5JKwo42hmAB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Use `normalizer_fn` so that the model only sees values in [0, 1]\n",
        "norm_latitude = lambda latitude:(latitude-min_latitude)/delta_latitude - 0.5\n",
        "norm_longitude = lambda longitude:(longitude-min_longitude)/delta_longitude - 0.5\n",
        "\n",
        "linear_fc = [tf.feature_column.numeric_column('latitude', normalizer_fn = norm_latitude), \n",
        "             tf.feature_column.numeric_column('longitude', normalizer_fn = norm_longitude)]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3jMhB4gHg8H1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Build and train the model\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.DenseFeatures(feature_columns=linear_fc),\n",
        "    tf.keras.layers.Dense(1),\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.MeanAbsoluteError())\n",
        "\n",
        "model.fit(train_ds, epochs=200, validation_data=test_ds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8-V8sYmPhJoL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "plot_model(model)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gfgngu0lk-NQ"
      },
      "source": [
        "## A DNN regressor\n",
        "Important: Pure categorical data doesn't the spatial relationships that make this example possible. Embeddings are a way your model can learn spatial relationships."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "On-_j4Jtk-NR",
        "colab": {}
      },
      "source": [
        "# Build and train the model\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.DenseFeatures(feature_columns=linear_fc),\n",
        "    tf.keras.layers.Dense(100, activation='elu'),\n",
        "    tf.keras.layers.Dense(100, activation='elu'),\n",
        "    tf.keras.layers.Dense(1),\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.MeanAbsoluteError())\n",
        "\n",
        "model.fit(train_ds, epochs=200, validation_data=test_ds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "WTcIX78Tk-NV",
        "colab": {}
      },
      "source": [
        "plot_model(model)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WIlTd5VEk-NZ"
      },
      "source": [
        "# Linear model with buckets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "NdKt4g2Kk-NZ",
        "colab": {}
      },
      "source": [
        "# Bucketize the latitude and longitude usig the `edges`\n",
        "latitude_bucket_fc = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column('latitude'), \n",
        "    list(latitude_edges))\n",
        "\n",
        "longitude_bucket_fc = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column('longitude'),\n",
        "    list(longitude_edges))\n",
        "\n",
        "seperable_fc = [\n",
        "    latitude_bucket_fc,\n",
        "    longitude_bucket_fc]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZYEDiuVyiXx4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Build and train the model\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.DenseFeatures(feature_columns=seperable_fc),\n",
        "    tf.keras.layers.Dense(1),\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.MeanAbsoluteError())\n",
        "\n",
        "model.fit(train_ds, epochs=200, validation_data=test_ds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "j4lhsk5rk-Nc",
        "colab": {}
      },
      "source": [
        "plot_model(model)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "G-QLbA0hKBbY"
      },
      "source": [
        "# Using `crossed_column` on its own."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "JoIXtYykKJei",
        "colab": {}
      },
      "source": [
        "# Cross the bucketized columns, using 5000 hash bins (for an average weight sharing of 2).\n",
        "crossed_lat_lon_fc = tf.feature_column.crossed_column(\n",
        "    [latitude_bucket_fc, longitude_bucket_fc], 2000)\n",
        "crossed_lat_lon_fc = tf.feature_column.indicator_column(crossed_lat_lon_fc)\n",
        "\n",
        "\n",
        "crossed_fc = [crossed_lat_lon_fc]\n",
        "\n",
        "# Build and train the model\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.DenseFeatures(feature_columns=crossed_fc),\n",
        "    tf.keras.layers.Dense(1),\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.MeanAbsoluteError())\n",
        "\n",
        "model.fit(train_ds, epochs=200, validation_data=test_ds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "R-itu9itLe0K",
        "scrolled": false,
        "colab": {}
      },
      "source": [
        "plot_model(model)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KHJ_KsRUk-Nj"
      },
      "source": [
        "# Using raw categories with `crossed_column` \n",
        "The model generalizes better if it also has access to the raw categories, outside of the cross. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "ukHo6NrTk-Nk",
        "colab": {}
      },
      "source": [
        "# Build and train the model\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.DenseFeatures(feature_columns=crossed_fc+seperable_fc),\n",
        "    tf.keras.layers.Dense(1, kernel_regularizer=tf.keras.regularizers.l2(0.0001)),\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adam(),\n",
        "              loss=tf.keras.losses.MeanAbsoluteError(),\n",
        "              metrics=[tf.keras.losses.MeanAbsoluteError()])\n",
        "\n",
        "model.fit(train_ds, epochs=200, validation_data=test_ds)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "QjOwalvDk-Nm",
        "colab": {}
      },
      "source": [
        "plot_model(model)"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
